About this activity

This module will introduce you to concepts and approaches to working with high-dimensional data — especially the kind you might encounter when studying gene expression in systems biology applications. Similar to clustering, principal component analysis (PCA) is an unsupervised learning approach that we can use to visualize and analyze data in reduced dimensions. We’ll show you how to calculate and inspect the outputs of PCA, and also how you can check for informative (or problematic) patterns.


Loading and inspecting the data

Run the chunk below to pre-load and format some data that you’ll be using for the activity below. Feel free to review the individual steps (hint: they’re similar to the transformations we applied in Module 1), but otherwise we’ll jump right into inspecting and working with the data!

load(file.path(data_dir, "pson_expr_tpm_df.RData"))
load(file.path(data_dir, "pson_expr_gene_info.RData"))
load(file.path(data_dir, "pson_motility_tidy_df.RData"))
pson_pca_inputs_file <- file.path(data_dir, "pson_pca_inputs.RData")
if (!file.exists(pson_pca_inputs_file)) {
  pson_expr_tpm_df2 <- merge(x = gene_df, y = pson_expr_tpm_df, 
                             by.x = "gene_id", by.y = "gene_id")
  pson_tpm_mat <- as.matrix(pson_expr_tpm_df2[, -c(1:5)])
  rownames(pson_tpm_mat) <- pson_expr_tpm_df2$symbol
  pson_logtpm_mat <- log2(1 + pson_tpm_mat)
  pson_pca_mat <- t(pson_logtpm_mat)
  save(list = c("pson_pca_mat", "pson_logtpm_mat"), 
       file = pson_pca_inputs_file)
} else {
  load(pson_pca_inputs_file)
}

Check out pson_expr_tpm_df:

# How do you see what is included in this data frame? Run the `str()` command. 
str(pson_expr_tpm_df) 
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   18682 obs. of  64 variables:
 $ gene_id : chr  "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
 $ mRNA_R17: num  33.56 0 169.46 1.85 5.73 ...
 $ mRNA_R21: num  45.1 0 129.88 1.85 11.85 ...
 $ mRNA_R20: num  39.42 0 132.06 1.77 10.16 ...
 $ mRNA_R19: num  35.61 0 113.73 1.93 9.28 ...
 $ mRNA_R18: num  32.1 0 123.4 2.3 10 ...
 $ mRNA_R16: num  39.49 0 136.29 1.54 8.93 ...
 $ mRNA_R15: num  29.9 0 137.98 2 7.47 ...
 $ mRNA_R38: num  20.8 0 185.06 1.75 7.92 ...
 $ mRNA_R42: num  19.63 0 135.75 1.41 5.5 ...
 $ mRNA_R41: num  24.6 0 173.8 1.76 8.55 ...
 $ mRNA_R40: num  12.37 0 188.45 0.84 4.33 ...
 $ mRNA_R39: num  18.25 0 173.7 1.54 5.73 ...
 $ mRNA_R37: num  21.11 0 179.63 2.35 5.16 ...
 $ mRNA_R36: num  17.54 0 143.49 1.83 7.34 ...
 $ mRNA_R24: num  4.95 0 182.84 3.49 6.03 ...
 $ mRNA_R28: num  4.01 0 179.46 3.81 6.4 ...
 $ mRNA_R27: num  3.61 0 177.32 4.87 7.31 ...
 $ mRNA_R26: num  3.61 0 161.67 5.07 11.12 ...
 $ mRNA_R25: num  3.45 0 209.87 3.9 5.64 ...
 $ mRNA_R23: num  4.35 0 171.66 2.99 6.01 ...
 $ mRNA_R22: num  2.98 0 155.91 4.27 6.16 ...
 $ mRNA_R45: num  10.35 0 124.89 5.14 14.63 ...
 $ mRNA_R49: num  9.95 0 152.52 4.29 11.95 ...
 $ mRNA_R48: num  6.47 0 97.52 2.12 9.32 ...
 $ mRNA_R47: num  9.83 0 148.47 5.34 10.1 ...
 $ mRNA_R46: num  5.07 0 61.81 1.28 9.42 ...
 $ mRNA_R44: num  11.59 0 132.04 8.4 9.22 ...
 $ mRNA_R43: num  2.96 0 14.48 0.45 6.05 ...
 $ mRNA_R10: num  6.22 0 98.06 4.46 8.48 ...
 $ mRNA_R14: num  7.73 0 81.59 5.01 14.43 ...
 $ mRNA_R13: num  5.59 0 100.83 5.81 10.52 ...
 $ mRNA_R12: num  7.72 0 86.01 6.16 7.09 ...
 $ mRNA_R11: num  8.73 0 87.73 2.24 1.85 ...
 $ mRNA_R9 : num  7.69 0 75.97 5.24 7.89 ...
 $ mRNA_R8 : num  7.28 0 96.31 3.83 3.84 ...
 $ mRNA_R3 : num  5.32 0 75.27 6.71 12.05 ...
 $ mRNA_R7 : num  5 0 72.06 7.07 10.21 ...
 $ mRNA_R6 : num  4.42 0 68.96 5.51 8.96 ...
 $ mRNA_R5 : num  3.22 0 75.23 10.49 7.03 ...
 $ mRNA_R4 : num  4.13 0 80.33 7.89 10.64 ...
 $ mRNA_R2 : num  4.48 0 87.15 8.51 10.74 ...
 $ mRNA_R1 : num  4.64 0 72.45 8.67 10.77 ...
 $ mRNA_R52: num  17.47 0 122.39 9.73 9.19 ...
 $ mRNA_R56: num  17.69 0 123.47 8.77 10.26 ...
 $ mRNA_R55: num  17.81 0 121.18 8.84 8.29 ...
 $ mRNA_R54: num  11.23 0 111.65 12.03 9.24 ...
 $ mRNA_R53: num  13.3 0 127.9 10.2 11 ...
 $ mRNA_R51: num  14.6 0 116.6 11.1 10.1 ...
 $ mRNA_R50: num  13.59 0 124.17 11.11 9.93 ...
 $ mRNA_R31: num  19.42 0 118.91 2.92 4.11 ...
 $ mRNA_R35: num  19.69 0.04 124.45 3.78 4.16 ...
 $ mRNA_R34: num  18.56 0 124.31 3.34 6.52 ...
 $ mRNA_R33: num  17.74 0 110.52 2.53 4.97 ...
 $ mRNA_R32: num  16.08 0 134.33 3.27 4.38 ...
 $ mRNA_R30: num  18.37 0 107.29 3.66 3.99 ...
 $ mRNA_R29: num  17.42 0 134.78 2.96 2.83 ...
 $ mRNA_R59: num  16.09 0 101.79 4.33 16.98 ...
 $ mRNA_R63: num  15.69 0 105.81 5.21 16.1 ...
 $ mRNA_R62: num  16.39 0 110.63 5.92 11.9 ...
 $ mRNA_R61: num  11.46 0 91.56 3.56 17.67 ...
 $ mRNA_R60: num  9.38 0 85.66 3.49 13.37 ...
 $ mRNA_R58: num  14.81 0 100.57 4.09 19.29 ...
 $ mRNA_R57: num  9.84 0 70.69 4.4 12.16 ...
# What are the sample names in `pson_expr_tpm_df`?
colnames(pson_expr_tpm_df)[-1] 
 [1] "mRNA_R17" "mRNA_R21" "mRNA_R20" "mRNA_R19" "mRNA_R18" "mRNA_R16" "mRNA_R15" "mRNA_R38" "mRNA_R42" "mRNA_R41" "mRNA_R40"
[12] "mRNA_R39" "mRNA_R37" "mRNA_R36" "mRNA_R24" "mRNA_R28" "mRNA_R27" "mRNA_R26" "mRNA_R25" "mRNA_R23" "mRNA_R22" "mRNA_R45"
[23] "mRNA_R49" "mRNA_R48" "mRNA_R47" "mRNA_R46" "mRNA_R44" "mRNA_R43" "mRNA_R10" "mRNA_R14" "mRNA_R13" "mRNA_R12" "mRNA_R11"
[34] "mRNA_R9"  "mRNA_R8"  "mRNA_R3"  "mRNA_R7"  "mRNA_R6"  "mRNA_R5"  "mRNA_R4"  "mRNA_R2"  "mRNA_R1"  "mRNA_R52" "mRNA_R56"
[45] "mRNA_R55" "mRNA_R54" "mRNA_R53" "mRNA_R51" "mRNA_R50" "mRNA_R31" "mRNA_R35" "mRNA_R34" "mRNA_R33" "mRNA_R32" "mRNA_R30"
[56] "mRNA_R29" "mRNA_R59" "mRNA_R63" "mRNA_R62" "mRNA_R61" "mRNA_R60" "mRNA_R58" "mRNA_R57"
# How are genes identified?
pson_expr_tpm_df$gene_id[1:5] 
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460"
# Andrew taught us many ways to inspect the data. Try this one.
View(pson_expr_tpm_df[1:1000, ]) # only view the first 1000 rows so we don't
                                 # make your browser too unhappy...

As you learned, the gene IDs are not very informative — they come from a database called Ensembl. More commonly, we would use the Hugo gene symbols (https://www.genenames.org/). These are the gene names you are more used to seeing (MYC, BRCA1, etc). We have this information stored in another file.

Check out the first few rows of gene_df:

# Run the command to view the first 6 rows of `gene_df`
head(gene_df) 

We now have a new variable called pson_pca_mat that we’ll use for the examples below. Take a look at the first few rows and columns of this matrix, and try to figure out what changes were made to go from pson_expr_tpm_df and gene_df to what you see. Compare to pson_logtpm_mat and see what else is different.

# Take a look at the first few rows and columns of `pson_pca_mat`
pson_pca_mat[1:5, 1:5] 
           TSPAN6 TNMD     DPM1    SCYL3 C1orf112
mRNA_R17 5.111031    0 7.413289 1.510962 2.750607
mRNA_R21 5.526695    0 7.032101 1.510962 3.683696
mRNA_R20 5.336997    0 7.055933 1.469886 3.480265
mRNA_R19 5.194166    0 6.842099 1.550901 3.361768
mRNA_R18 5.048759    0 6.958495 1.722466 3.460743
# Take a look at the first few rows and columns of `pson_logtpm_mat`
pson_logtpm_mat[1:5, 1:5] 
         mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
TSPAN6   5.111031 5.526695 5.336997 5.194166 5.048759
TNMD     0.000000 0.000000 0.000000 0.000000 0.000000
DPM1     7.413289 7.032101 7.055933 6.842099 6.958495
SCYL3    1.510962 1.510962 1.469886 1.550901 1.722466
C1orf112 2.750607 3.683696 3.480265 3.361768 3.460743

Besides any normalization (e.g., “log” of expression values), translation (changing gene IDs to symbols), or format conversion (data frame to matrix), we’ve also transposed or rotated the data such that genes are now represented by columns and samples by rows. More on why we did this next!


Getting started with PCA with PS-ON cell line data

We will use the function prcomp() to run the principal component analysis (PCA) algorithm. We’re interested in checking out the variability and similarity of samples (cell lines) in our data based on their gene expression patterns. To do this, we need to give prcomp() a matrix as where the samples are the rows, and the features (transcript counts) are the columns. If we hadn’t transposed, PCA would still work — but it would instead be telling us about the relationship between genes.

Calculating PCA

We want to make sure that we can connect our expression data to motility measurements. I happen to know that for one sample in pson_pca_mat, we’re missing motility data (I cheating a bit and checked in advance, but it makes life easier given our time constraints).

# let's locate and remove the sample with no motility measurement data
pca_mat_samples <- rownames(pson_pca_mat)
missing_row <- which(!(pca_mat_samples %in% pson_motil_tidy_df$sample))
missing_row
[1] 8
pson_pca_mat <- pson_pca_mat[-missing_row, ]   

As described in our lecture, there are two additional things for us to consider:

  1. Whether the data should be mean-centered around 0.
  2. Whether the data should be scaled so that the standard deviation for each gene is 1.

Our features have the same units, and it is often the case that they do not need to be scaled; this is somewhat subjective. The prcomp() function automatically mean-centers and scales the data for us by default. Because we want to preserve the natural differences in expression variance between genes, we’ll elect not to scale in this case:

pson_pca <- prcomp(pson_pca_mat, scale = FALSE)

Exciting, right? But now we can get to our exploratory analysis. What goodies does pson_pca hold for us?

names(pson_pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"       

We discussed these in the lecture. To summarize:

  • sdev: The standard deviation of the principal components
  • rotation: The matrix of feature loadings (weights)
  • center: The feature means that prcomp() subtracted from each feature
  • scale: The feature standard deviations used in scaling
  • x: Coordinates of the cell lines projected onto the principal components
dim(pson_pca$sdev)
NULL
dim(pson_pca$rotation)
[1] 18682    62
dim(pson_pca$center)
NULL
dim(pson_pca$scale)
NULL
dim(pson_pca$x)
[1] 62 62

Are these dimensions what you would expect?

pson_pca$rotation[1:5, 1:5]
                   PC1           PC2           PC3           PC4           PC5
TSPAN6    1.584158e-04 -4.610167e-03  5.191321e-03  3.180007e-03  1.729562e-02
TNMD      2.753341e-05  1.197326e-05 -2.777844e-06 -2.525294e-05  1.860216e-06
DPM1     -5.799012e-04 -1.316543e-03  7.183747e-03 -6.022465e-04 -1.600115e-03
SCYL3    -2.893989e-03  1.230555e-02 -1.404974e-03  7.094512e-04 -2.288287e-03
C1orf112 -2.880724e-03  1.562732e-03  9.381906e-04  1.681514e-03  2.043382e-03
pson_pca$x[1:5, 1:5]
               PC1       PC2       PC3         PC4      PC5
mRNA_R17 -39.74973 -76.79247  6.395282  -1.5573725 34.99499
mRNA_R21 -44.31163 -70.45641 16.540568  -5.2920620 50.47853
mRNA_R20 -45.41556 -80.75207  9.007610  -0.0308181 42.11339
mRNA_R19 -42.46163 -73.68480 13.031576  -3.3063448 45.94905
mRNA_R18 -37.95296 -62.22346 19.415076 -10.2662479 43.98570
length(pson_pca$sdev)
[1] 62
length(pson_pca$center)
[1] 18682
length(pson_pca$scale)
[1] 1

Check the values out! Why does scale have length 1?

In this module, we are interested in sdev to calculate the variation explained by the principal components, and x, which provides the coordinates of the samples (cell lines) onto the principal components.

Has PCA reduced the dimensionality of our data? Let’s examine the importance of the principal components by ploting the percentage of variance explained.

pve <- 100 * pson_pca$sdev^2 / sum(pson_pca$sdev^2)
par(mfrow = c(1, 2))
plot(pve,
     type = "o", cex = 0.8,
     ylab = "PVE", xlab = "Principal Component", col = "blue"
)
# `cumsum` is a handy function to calculate the "cumulative sum" of a vector;
# for example:
# > 1:5
# [1] 1 2 3 4 5
# > cumsum(1:5)
# [1]  1  3  6 10 15
plot(cumsum(pve),
     type = "o", cex = 0.8,
     ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3"
)

Visualizing patterns and groups with PCA

The first two PCs account for a lot of the variance, > 40%. The first 10 PCs account for almost all of the variance. Let’s see if the first two PCs, PC1 and PC2, reveal any patterns in our data.

plot(pson_pca$x[, 1:2],
     pch = 19, cex = 0.8,
     xlab = "Projection onto PC1", ylab = "Projection onto PC2"
)

Each point gives the coordinates of a sample/cell line in PC space. It’s a nice pattern with some potential. Does is have meaning for our data set? Let’s color the points by cell line type and diagnosis.

cell_speeds_df <- subset(pson_motil_tidy_df, summary_metric == "speed_um_hr")
names(cell_speeds_df)
 [1] "id_motil"                      "summary_metric"                "average_value"                 "standard_deviation"           
 [5] "standard_error"                "total_number_of_cells_tracked" "sample"                        "id_expr"                      
 [9] "diagnosis"                     "cellType"                      "catalogNumber"                 "cellLine"                     
[13] "experimentalCondition"         "stiffness"                     "stiffness_units"               "surface"                      
[17] "laminate"                      "organ"                         "tumorType"                     "stiffness_norm"               
[21] "average_value_scaled"         

We will color by cellLine and diagnosis, and it is possible to explore other relationships as well.

Here is a simple function for coloring points by type — we’ll use another one of the “Color Brewer” palettes that we learned about in Module 2. (note: running this next chunk won’t do anything by itself, but we’ll use the new color_by_type() function in several places below)

color_by_type <- function(vec) {
  nn <- length(unique(vec))
  cols <- RColorBrewer::brewer.pal(nn, "Set1")
  return(cols[as.numeric(as.factor(vec))])
}

First, let’s color the points in the PC2 vs. PC1 space according to diagnosis. We’ll adjust the transparency a bit with the alpha() function, such that overlapping points are easier to see.

colors <- color_by_type(unique(cell_speeds_df$diagnosis))
diag <- unique(cell_speeds_df$diagnosis)
plot(pson_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(cell_speeds_df$diagnosis), 0.8),
     main = "Diagnosis", pch = 19, cex = 0.8
)
legend("bottomright", fill = colors, legend = diag, cex = 0.8)

TIMEOUT This is a fairly important point that we should take a moment to consider. Running PCA and seeing our data separate out so cleanly like this can be exciting, but it should make us pause and take a closer look. We need to check the history of how the data were generated to make sure that PCA patterns reflect biological differences and not batch effects. Can you think of other reasons why different cancer types might appear so distinct in (PC1, PC2) space?

OK — scary word of caution aside, let’s get back to the activity…

We can use the same approach to color points based on cell line name.

colors <- color_by_type(unique(cell_speeds_df$cellLine))
cell <- unique(cell_speeds_df$cellLine)
plot(pson_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(cell_speeds_df$cellLine), 0.8),
     main = "Cell Line", pch = 19, cex = 0.8
)
legend("bottomright", fill = colors, legend = cell, cex = 0.8)

T-47D and MDA-MB-231, two breast cancer cell lines form distinct well-separated groups. Can you think of reasons (besides batch effects) why these points might not be grouped together? Note that the two colon cancer cell lines (SW620 and SW480) are grouped together.

Sometimes we can pick out patterns and information more easily by encoding multiple layers of data in our plot. For example, we can simultaneously look at cell line (color) and diagnosis (label) to more easily see how points relate to each other according to these variables.

colors <- color_by_type(unique(cell_speeds_df$cellLine))
cell <- unique(cell_speeds_df$cellLine)
plot(pson_pca$x[, c(1, 2)],
     main = "Cell Line", pch = 19, cex = 0
)
text(pson_pca$x[, c(1, 2)], 
     labels = gsub(" Cancer", "", cell_speeds_df$diagnosis),
     col = scales::alpha(color_by_type(cell_speeds_df$cellLine), 0.8),
     pos = 3, offset = 0, cex = 0.8)
legend("bottomright", fill = colors, legend = cell, cex = 0.8)

Again, we can see that the two breast cancer cell lines do NOT cluster together. In fact, the MDA-MB-231 cell line appears more closely related to several other cancer types than to the T-47D cell line — why might this be?


PCA with patient data from TCGA

Load the data

We’ll shortcut things here again, and let you run the chunk below to get the data in the right format for the next steps.

load(file.path(data_dir, "tcga_brca_expr_gene_info.RData"))
load(file.path(data_dir, "tcga_brca_expr_norm_df.RData"))
load(file.path(data_dir, "tcga_brca_clinical_df.RData"))
x1 <- names(brca_expr_norm_df)[-1]
x2 <- brca_clinical_df[["bcr_patient_barcode"]]
sample.names <- intersect(x1, x2)
sample.names <- sort(sample.names)
I <- match(sample.names, names(brca_expr_norm_df))
I <- c(1, I)
brca_expr_norm_df <- brca_expr_norm_df[, I]
I <- match(sample.names, brca_clinical_df[["bcr_patient_barcode"]])
brca_clinical_df <- brca_clinical_df[I, ]
brca_expr_norm_df2 <- merge(x = gene_df, y = brca_expr_norm_df,
                            by.x = "gene_id",  by.y = "gene_id")
brca_expr_mat <- as.matrix(brca_expr_norm_df2[, -c(1:5)])
rownames(brca_expr_mat) <- brca_expr_norm_df2$symbol
brca_log_mat <- log2(1 + brca_expr_mat)

Take a look at brca_log_mat:

brca_log_mat[1:5, 1:5]
         TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
TSPAN6      7.5636463     7.705439     9.975045    10.110718     9.881960
TNMD        0.4272843     1.061776     5.353718     1.164271     2.506120
DPM1        9.0367945     9.662019     9.919403     8.859174     8.928912
SCYL3       8.3493520    10.607275     8.395877     9.103957     8.704889
C1orf112    6.9691735     8.106778     7.922947     7.776269     7.495535

What step are we missing before calculating PCA (note: we’re still interested in looking at the relationships between samples)?

# Fill in the command here to "rotate" the matrix to get our data ready for PCA
brca_pca_mat <- t(brca_log_mat) # TODO: DELETE BEFORE ACTIVITY

Calculating PCA

Implementing the PCA algorithm in R is now straightforward.

This will take a bit longer than the PS-ON example above — take a moment to ask questions or look ahead to the next steps!

brca_pca <- prcomp(brca_pca_mat, scale = FALSE)
dim(brca_pca$x)
[1] 1083 1083
brca_pca$x[1:5,1:5]
                   PC1       PC2        PC3        PC4        PC5
TCGA-3C-AAAU -56.61926  41.33468  -5.566986  18.938057 -36.728384
TCGA-3C-AALI  12.26897  18.51457 -35.538525 -45.265949   1.104241
TCGA-3C-AALJ -11.48759  19.85753 -40.409893 -35.419112 -18.948293
TCGA-3C-AALK -12.88121 -35.45116 -36.698849   4.173563  24.755150
TCGA-4H-AAAK -13.95289 -31.62926 -20.154610  16.568536  22.762430

Let’s look at the percent variance explained (or accounted for) by the principal components.

pve <- 100 * brca_pca$sdev^2 / sum(brca_pca$sdev^2)
plot(pve, type = "o", cex = 0.8,
     ylab = "PVE", xlab = "Principal Component", col = "blue")

Because we have so many PCs (the number of samples - 1 = 1082), we’ll consider only the first 100.

par(mfrow = c(1, 2))
plot(pve, type = "o", cex = 0.8, 
     ylab = "PVE", xlab = "Principal Component", 
     xlim = c(1, 100), col = "blue")
plot(cumsum(pve), type = "o", cex = 0.8,
     ylab = "Cumulative PVE", xlab = "Principal Component", 
     xlim = c(1, 100), col = "brown3")

Still a bit hard to eyeball from the plots… We can also print out the percentages to get a better sense for how they’re increasing:

data.frame(PC = 1:10,
           PVE = round(pve[1:10], 0),
           Cumulative.PVE = round(cumsum(pve)[1:10], 0))

In contrast to the PS-ON PCA where the first two PCs account for 40% of the variance, the first two PCs for the TCGA BRCA analysis account for only 20% of the variance. However, for large gene expression data sets, this is considered pretty good, and often PC1 and PC2 carry a fair amount of useful information.

plot(brca_pca$x[, c(1, 2)], cex = 0.8)

We see roughly two clusters. Do these have biologial meaning? What clinical information do we have on the TCGA BRCA samples?

names(brca_clinical_df)
 [1] "bcr_patient_uuid"                                       "bcr_patient_barcode"                                   
 [3] "acronym"                                                "gender"                                                
 [5] "vital_status"                                           "days_to_birth"                                         
 [7] "days_to_last_followup"                                  "days_to_initial_pathologic_diagnosis"                  
 [9] "age_at_initial_pathologic_diagnosis"                    "icd_10"                                                
[11] "tissue_retrospective_collection_indicator"              "icd_o_3_histology"                                     
[13] "tissue_prospective_collection_indicator"                "history_of_neoadjuvant_treatment"                      
[15] "icd_o_3_site"                                           "tumor_tissue_site"                                     
[17] "new_tumor_event_after_initial_treatment"                "radiation_therapy"                                     
[19] "race"                                                   "prior_dx"                                              
[21] "ethnicity"                                              "informed_consent_verified"                             
[23] "person_neoplasm_cancer_status"                          "patient_id"                                            
[25] "year_of_initial_pathologic_diagnosis"                   "histological_type"                                     
[27] "tissue_source_site"                                     "form_completion_date"                                  
[29] "pathologic_T"                                           "pathologic_M"                                          
[31] "pathologic_N"                                           "system_version"                                        
[33] "pathologic_stage"                                       "lymph_node_examined_count"                             
[35] "initial_pathologic_diagnosis_method"                    "number_of_lymphnodes_positive_by_he"                   
[37] "anatomic_neoplasm_subdivision"                          "menopause_status"                                      
[39] "margin_status"                                          "breast_carcinoma_progesterone_receptor_status"         
[41] "breast_carcinoma_surgical_procedure_name"               "axillary_lymph_node_stage_method_type"                 
[43] "breast_carcinoma_estrogen_receptor_status"              "lab_proc_her2_neu_immunohistochemistry_receptor_status"
[45] "lab_procedure_her2_neu_in_situ_hybrid_outcome_type"    

Ok, so we have quite a lot of information. Let’s consider estrogen receptor (ER) status. We considered this in the lecture for a different brca set. I am including lines to pull the progeterone receptor (PR) and the receptor tyrosine kinase erbB-2 (HER2) for your examination later.

ER_status <- brca_clinical_df$breast_carcinoma_estrogen_receptor_status

What values are encompassed by estrogen receptor status in the brca clinical features data frame? How many samples are labeled with each value?

# one way to check this is to loop through each unique value and then count
# the occurrences of each
unique_status <- unique(ER_status)
for (i in 1:length(unique_status)) {
  print(c(unique_status[i], length(ER_status[ER_status == unique_status[i]])))
}
[1] "Positive" "797"     
[1] "Negative" "236"     
[1] "[Not Evaluated]" "48"             
[1] "Indeterminate" "2"            
# pro tip: another way to do this is with the `table()` function!
table(ER_status)
ER_status
[Not Evaluated]   Indeterminate        Negative        Positive 
             48               2             236             797 

Visualizing patterns and groups with PCA

We’ll color the samples in the PC1/PC2 projection plot according to ER status.

colors <- color_by_type(unique(ER_status))
status <- unique(ER_status)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(ER_status), 0.7),
     main = "ER status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors, legend = status, cex = 0.8)


mini-DREAM Challenge

QUESTION: Does triple-negative status correspond to major variation among TCGA breast cancer samples? If so, do we expect this variation to relate positively or negatively to metastatic outcomes? What genes individual genes might be driving this variation?

Triple-negative breast cancers are generally diagnosed based on the lack of three receptors known to fuel most breast cancers: estrogen receptors (ER), progesterone receptors (PR) and human epidermal growth factor receptor 2 (HER2). The most successful treatments for breast cancer target these receptors.

To investigate these questions, follow the instructions and provide your answers in the 3 “parts” below for these variables:

PART 1: Inspecting the other 2/3 of “triple”

Take a look at how the expression of PR and HER2 relate to the first principal component (PC1) in the data.

# Replace the blanks with the variable name for PR status (remember that PR
# is short for "progesterone receptor")
PR_status <- brca_clinical_df$breast_carcinoma_progesterone_receptor_status
colors <- color_by_type(unique(PR_status))
status <- unique(PR_status)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(PR_status), 0.7),
     main = "PR status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors, legend = status, cex = 0.8)

# Replace the blanks with the variable name for HER2 status. Note: we're
# specifically looking for HER2 status determined from immunohistochemistry 
# lab processing
HER2_status <- brca_clinical_df$lab_proc_her2_neu_immunohistochemistry_receptor_status
colors <- scales::alpha(color_by_type(unique(HER2_status)))
status <- unique(HER2_status)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(HER2_status), 0.7),
     main = "HER2 status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors, legend = status, cex = 0.8)

Which of the two PC2 vs. PC1 plots is most informative? One of the two receptors should show better separation of “Positive” and “Negative” status along the PC1 axis.

# fill in "PR" or "HER2"
my_pc1_receptor <- "PR"

PART 2: Triple-negative status and outcome

First, check your answer! If the expression status of one of these two genes is less informative, then a “double negative” status based on only ER and the more informative gene should show greater separation than the when all 3 genes are included.

less_informative <- HER2_status # fill in "HER2_status" or "PR_status" 
more_informative <- PR_status # fill in "HER2_status" or "PR_status"
triple_negative <- (
  ER_status == "Negative" &  
  more_informative == "Negative" &
  less_informative == "Negative"
)
double_negative <- (
  ER_status == "Negative" &  
  more_informative == "Negative"
)
par(mfrow = c(1, 2))
colors_trip <- scales::alpha(color_by_type(unique(triple_negative)))
status_trip <- unique(triple_negative)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(triple_negative), 0.5),
     main = "Triple-negative status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors_trip, legend = status_trip, cex = 0.8)
colors_doub <- scales::alpha(color_by_type(unique(double_negative)))
status_doub <- unique(double_negative)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(double_negative), 0.5),
     main = "Double-negative status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors_doub, legend = status_doub, cex = 0.8)

Assume that PC1 is positively correlated with metastasis (we don’t know for sure yet whether it is or not). If that were the case, would “triple negative” be more or less likely to be associated with metastatic samples?

# fill in "more" or "less"
my_tripleneg_met_association <- "more" # TODO: REPLACE WITH "" FOR ACTIVITY

PART 3: Connecting PC1 to individual genes

PCA calculates loading values that rank, in a relative sense, how important genes are to a particular principal component. Look again at the information PCA provides:

names(brca_pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"       

The rotation element contains loading values for each principal component (column) on each gene (rows). Let’s create a new variable to store these values.

# Create a matrix of loading values
loading_values_mat <- brca_pca$rotation

For example, the first column of the loading value matrix contains the loading values for PC1 on each gene. Sort the loading values for PC1 in increasing order; these have the most negative values for PC1.

# Sort and inspect PC1 loading values
pc1_loadings <- loading_values_mat[, 1]
pc1_loadings_sorted <- sort(pc1_loadings, decreasing = FALSE)
# Get the first 6 values from `pc1_loadings_sorted`
head(pc1_loadings_sorted)
       AGR3        TFF1        ESR1        TFF3     C1orf64        AGR2 
-0.06289317 -0.05814906 -0.05189125 -0.05133283 -0.05036492 -0.04952857 

You should see the following output.

       AGR3        TFF1        ESR1        TFF3     C1orf64        AGR2
-0.06289317 -0.05814906 -0.05189125 -0.05133283 -0.05036492 -0.04952857

Consider the first few genes. Search for one in the Human Protein Atlas (HPA), https://www.proteinatlas.org/. Click on the “pathology” link to learn more about this gene’s role in different cancers. Is the gene highly expressed in breast cancer? Does it appear to be prognostic of breast cancer?

# Enter the name of your gene here. It should match one of the 6 names above.
my_hpa_gene <- "TFF1" # TODO: REPLACE WITH "" FOR ACTIVITY
# Look at the section named "RNA EXPRESSION OVERVIEW". Is breast cancer one
# of the diagnoses listed as "tissue enhanced"? Fill in "yes" or "no".
my_hpa_is_enchanced <- "yes" # TODO: REPLACE WITH "" FOR ACTIVITY
# Look at the section named "PROGNOSTIC SUMMARY". Is the gene reported to be a
# prognostic marker for breast cancer? Fill in "yes" or "no".
my_hpa_is_prognostic <- "yes" # TODO: REPLACE WITH "" FOR ACTIVITY

Submitting the prediction

You’re now ready to submit the prediction. Just run the chunk below, a file with your prediction will be uploaded to Synapse and submitted to the challenge. You’ll be able to see the results of your prediction on the mini-DREAM scoreboards, with the submission ID that gets printed below.

library(synapser)
# If you didn't submit an answer for a previous module, remove the '#'s
# on the next 2 lines and enter your Synapse username and password
# before running this chunk.
# synLogin('my_synapse_username', 'my_synapse_password',
#          rememberMe = TRUE, silent = TRUE)
# If you submitted for Module 0, 1, or 2, this should work.
synLogin(silent = TRUE)
NULL
submission <- submit_module_answers(module = 3)
################################################## Uploading file to Synapse storage ##################################################
Uploading [--------------------]0.00%   0.0bytes/113.0bytes  jaeddy_activity-3.yml     
Uploading [####################]100.00%   113.0bytes/113.0bytes (225.8bytes/s) jaeddy_activity-3.yml Done...    

Successfully submitted file: 'jaeddy_activity-3.yml'
... stored as 'syn12617495'
Submission ID: '9678270

Congrats — you’ve reached the end of Module 3! You can now return to the mini-DREAM Challenge site on Synapse.


Bonus: Image compression with PCA

Just for fun, we’ve included a little activity to demonstrate how PCA (and dimensionality reduction) can be used to compress an image file. While this isn’t exactly how common compression formats work — e.g., JPEG doesn’t use PCA — the idea of capturing and preserving useful information (and discarding less useful information) is the same. Feel free to check it out!

WARNING: the code below might look strange compared to the R you’ve encountered in other modules. The style and some of the specific packages/functions are generally referred to as part of the “tidyverse” (which you can look up if you want to learn more). Don’t worry too much about understanding the code here — you won’t be tested on it!

library(imager)
library(broom)
library(tidyverse)

Get Synapse IDs for TIFF images of the MDA-MB-231 breast cancer cell line from the motility study (from a single plate in the “Hyaluronic Acid Fibronectin” condition).

img_file <- file.path(data_dir, "MDAMB231_plate_1_timelapse_61_001.tif")
cell_img <- load.image(img_file)
cell_img
Image. Width: 1280 pix Height: 1024 pix Depth: 1 Colour channels: 1 

Viewing a cell image in R

Here’s what the image looks like, as downloaded.

plot(cell_img)

If we save the file as a JPEG (with maximum quality), we can get a starting size.

save.image(cell_img, "img.jpeg", quality = 1)
fs::file_info("img.jpeg") %>% 
  pluck("size")
624K

Getting PCs for the image matrix

Convert the image to a matrix and compute principal components (PCs) with prcomp.

cell_img_mat <- as.matrix(cell_img)
cell_img_pca <- prcomp(cell_img_mat)

The number of PCs (not surprisingly) is equal to the dimensions of the original image.

cell_img_pca_tidy <- tidy(cell_img_pca, matrix = "pcs") 
package ‘bindrcpp’ was built under R version 3.4.4
nrow(cell_img_pca_tidy)
[1] 1024

Let’s check out the percent variance explained by the top 50 PCs.

cell_img_pca_tidy %>% 
  slice(1:50) %>%
  ggplot(aes(x = PC, y = percent)) +
  geom_col(alpha = 0.7)

A fair amount of variance in the image data is explained by PC1.

Using PC projections to compress the image

I’ll create a function to pick out the top num_pcs to create a compressed version of the image. Plotting the reconstructed image from PC1 only:

compress_img <- function(img_pca, num_pcs) {
  img_pca$x[, 1:num_pcs] %*% t(img_pca$rotation[, 1:num_pcs]) %>% 
    t() %>% 
    as.cimg()
}
compress_img(cell_img_pca, num_pcs = 1) %>% 
  plot()

So, a certainly a hint of the original… (yes, the image has been rotated 90 degrees — let’s not worry about that for now)

What about using the top 10 PCs?

compress_img(cell_img_pca, num_pcs = 10) %>% 
  plot()

Approaching the compression a bit more systematically, I can figure out how many PCs would be needed to capture 75% of the variation (~information) in the image data.

cell_img_pca_tidy %>% 
  ggplot(aes(x = PC, y = cumulative)) +
  geom_col(alpha = 0.7) +
  geom_vline(aes(xintercept = max(PC[cumulative <= 0.75])),
             linetype = 2, size = 1, colour = "blue3")

pcs_75 <- cell_img_pca_tidy %>% 
    filter(cumulative <= .75) %>%
    top_n(1, PC) %>%  
    pluck("PC")
pcs_75
[1] 45

That’s only 45 PCs (out of 1024) to reconstruct a fairly accurate version of the image!

compress_img(cell_img_pca, num_pcs = pcs_75) %>% 
  plot()

I can now save the compressed image as a JPEG to see the resulting file size.

cell_img_compressed <- compress_img(cell_img_pca, num_pcs = pcs_75)
save.image(cell_img_compressed, "img_compressed.jpeg", quality = 1)
fs::file_info("img_compressed.jpeg") %>% 
  pluck("size")
396K

That’s about 60% of the original file size. Success!


---
title: "Understanding data in reduced dimensions"
author: "Diana Murray"
date: "June 25, 2018"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE, message=FALSE}

knitr::opts_chunk$set(echo = TRUE)

set_basepath <- function(type = c("data", "R")) {
  if (stringr::str_length(Sys.which("rstudio-server"))) {
    file.path("/home/shared", type)
  } else {
    here::here(type)
  }
}

data_dir <- set_basepath("data")
scripts_dir <- set_basepath("R")
source(file.path(scripts_dir, "submission_helpers.R"))
```

# About this activity

This module will introduce you to concepts and approaches to working with high-dimensional data — especially the kind you might encounter when studying gene expression in systems biology applications. Similar to clustering, principal component analysis (PCA) is an unsupervised learning approach that we can use to visualize and analyze data in *reduced* dimensions. We'll show you how to calculate and inspect the outputs of PCA, and also how you can check for informative (or problematic) patterns.

---

# Loading and inspecting the data

Run the chunk below to pre-load and format some data that you'll be using for the activity below. Feel free to review the individual steps (**hint:** they're similar to the transformations we applied in Module 1), but otherwise we'll jump right into inspecting and working with the data!

```{r}
load(file.path(data_dir, "pson_expr_tpm_df.RData"))
load(file.path(data_dir, "pson_expr_gene_info.RData"))
load(file.path(data_dir, "pson_motility_tidy_df.RData"))

pson_pca_inputs_file <- file.path(data_dir, "pson_pca_inputs.RData")
if (!file.exists(pson_pca_inputs_file)) {
  pson_expr_tpm_df2 <- merge(x = gene_df, y = pson_expr_tpm_df, 
                             by.x = "gene_id", by.y = "gene_id")
  pson_tpm_mat <- as.matrix(pson_expr_tpm_df2[, -c(1:5)])
  rownames(pson_tpm_mat) <- pson_expr_tpm_df2$symbol
  pson_logtpm_mat <- log2(1 + pson_tpm_mat)
  pson_pca_mat <- t(pson_logtpm_mat)
  save(list = c("pson_pca_mat", "pson_logtpm_mat"), 
       file = pson_pca_inputs_file)
} else {
  load(pson_pca_inputs_file)
}
```

Check out `pson_expr_tpm_df`:

```{r}
# How do you see what is included in this data frame? Run the `str()` command. 
 
```

```{r}
# What are the sample names in `pson_expr_tpm_df`?
colnames(pson_expr_tpm_df)[-1] 

# How are genes identified?
pson_expr_tpm_df$gene_id[1:5] 
```

```{r}
# Andrew taught us many ways to inspect the data. Try this one.
View(pson_expr_tpm_df[1:1000, ]) # only view the first 1000 rows so we don't
                                 # make your browser too unhappy...
```

As you learned, the gene IDs are not very informative — they come from a database called **Ensembl**. More commonly, we would use the **Hugo gene symbols** (https://www.genenames.org/). These are the gene names you are more used to seeing (MYC, BRCA1, etc). We have this information stored in another file.

Check out the first few rows of `gene_df`:

```{r}
# Run the command to view the first 6 rows of `gene_df`
____(gene_df) # replace the blanks!
```

We now have a new variable called `pson_pca_mat` that we'll use for the examples below. Take a look at the first few rows and columns of this matrix, and try to figure out what changes were made to go from `pson_expr_tpm_df` and `gene_df` to what you see. Compare to `pson_logtpm_mat` and see what else is different.

```{r}
# Take a look at the first few rows and columns of `pson_pca_mat`


# Take a look at the first few rows and columns of `pson_logtpm_mat`

```

Besides any normalization (e.g., "log" of expression values), translation (changing gene IDs to symbols), or format conversion (data frame to matrix), we've also **transposed** or rotated the data such that genes are now represented by columns and samples by rows. More on why we did this next!

---

# Getting started with PCA with PS-ON cell line data

We will use the function **`prcomp()`** to run the principal component analysis (PCA) algorithm. We're interested in checking out the variability and similarity of samples (cell lines) in our data based on their gene expression patterns. To do this, we need to give `prcomp()` a matrix as where the samples are the *rows*, and the features (transcript counts) are the *columns*. If we hadn't transposed, PCA would still work — but it would instead be telling us about the relationship *between* genes.

## Calculating PCA

We want to make sure that we can connect our expression data to motility measurements. I happen to know that for one sample in `pson_pca_mat`, we're missing motility data (I cheating a bit and checked in advance, but it makes life easier given our time constraints).

```{r}
# let's locate and remove the sample with no motility measurement data
pca_mat_samples <- rownames(pson_pca_mat)
missing_row <- which(!(pca_mat_samples %in% pson_motil_tidy_df$sample))
missing_row
pson_pca_mat <- pson_pca_mat[-missing_row, ]   
```

As described in our lecture, there are two additional things for us to consider:

1. Whether the data should be mean-centered around 0. 
2. Whether the data should be scaled so that the standard deviation for each gene is 1. 

Our features have the same units, and it is often the case that they do not need to be scaled; this is somewhat subjective. The `prcomp()` function automatically mean-centers and scales the data for us by default. Because we want to preserve the natural differences in expression variance between genes, we'll elect *not* to scale in this case:

```{r}
pson_pca <- prcomp(pson_pca_mat, scale = FALSE)
```

Exciting, right?  But now we can get to our exploratory analysis. What goodies does **`pson_pca`** hold for us?

```{r}
names(pson_pca)
```

We discussed these in the lecture. To summarize:

+ `sdev`: The standard deviation of the principal components
+ `rotation`: The matrix of feature loadings (weights)
+ `center`: The feature means that prcomp() subtracted from each feature
+ `scale`: The feature standard deviations used in scaling
+ `x`: Coordinates of the cell lines projected onto the principal components

```{r}
dim(pson_pca$sdev)
dim(pson_pca$rotation)
dim(pson_pca$center)
dim(pson_pca$scale)
dim(pson_pca$x)
```

Are these dimensions what you would expect?

```{r}
pson_pca$rotation[1:5, 1:5]
```

```{r}
pson_pca$x[1:5, 1:5]
```


```{r}
length(pson_pca$sdev)
length(pson_pca$center)
length(pson_pca$scale)
```

Check the values out! Why does `scale` have length 1?

In this module, we are interested in `sdev` to calculate the variation explained by the principal components, and `x`, which provides the coordinates of the samples (cell lines) onto the principal components.

Has PCA reduced the dimensionality of our data? Let's examine the importance of the principal components by ploting the percentage of variance explained. 

```{r}
pve <- 100 * pson_pca$sdev^2 / sum(pson_pca$sdev^2)

par(mfrow = c(1, 2))

plot(pve,
     type = "o", cex = 0.8,
     ylab = "PVE", xlab = "Principal Component", col = "blue"
)

# `cumsum` is a handy function to calculate the "cumulative sum" of a vector;
# for example:
# > 1:5
# [1] 1 2 3 4 5
# > cumsum(1:5)
# [1]  1  3  6 10 15
plot(cumsum(pve),
     type = "o", cex = 0.8,
     ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3"
)
```

## Visualizing patterns and groups with PCA

The first two PCs account for a lot of the variance, > 40%.  The first 10 PCs account for almost all of the variance.  Let's see if the first two PCs, PC1 and PC2, reveal any patterns in our data.

```{r}
plot(pson_pca$x[, 1:2],
     pch = 19, cex = 0.8,
     xlab = "Projection onto PC1", ylab = "Projection onto PC2"
)
```

Each point gives the coordinates of a sample/cell line in PC space. It's a nice pattern with some potential. Does is have meaning for our data set? Let's color the points by cell line type and diagnosis.

```{r}
cell_speeds_df <- subset(pson_motil_tidy_df, summary_metric == "speed_um_hr")

names(cell_speeds_df)
```

We will color by `cellLine` and `diagnosis`, and it is possible to explore other relationships as well.

Here is a simple function for coloring points by type — we'll use another one of the "*Color Brewer*" palettes that we learned about in Module 2. (**note:** running this next chunk won't do anything by itself, but we'll use the new `color_by_type()` function in several places below)

```{r}
color_by_type <- function(vec) {
  nn <- length(unique(vec))
  cols <- RColorBrewer::brewer.pal(nn, "Set1")

  return(cols[as.numeric(as.factor(vec))])
}
```

First, let's color the points in the PC2 vs. PC1 space according to diagnosis. We'll adjust the transparency a bit with the `alpha()` function, such that overlapping points are easier to see.

```{r}
colors <- color_by_type(unique(cell_speeds_df$diagnosis))
diag <- unique(cell_speeds_df$diagnosis)
plot(pson_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(cell_speeds_df$diagnosis), 0.8),
     main = "Diagnosis", pch = 19, cex = 0.8
)

legend("bottomright", fill = colors, legend = diag, cex = 0.8)
```

> TIMEOUT
> This is a fairly important point that we should take a moment to consider. Running PCA and seeing our data separate out so cleanly like this can be exciting, but it should make us pause and take a closer look. We need to check the history of how the data were generated to make sure that PCA patterns reflect *biological differences* and not *batch effects*. Can you think of other reasons why different cancer types might appear so distinct in (PC1, PC2) space?

OK — scary word of caution aside, let's get back to the activity...

We can use the same approach to color points based on cell line name.

```{r}
colors <- color_by_type(unique(cell_speeds_df$cellLine))
cell <- unique(cell_speeds_df$cellLine)
plot(pson_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(cell_speeds_df$cellLine), 0.8),
     main = "Cell Line", pch = 19, cex = 0.8
)

legend("bottomright", fill = colors, legend = cell, cex = 0.8)
```

T-47D and MDA-MB-231, two breast cancer cell lines form distinct well-separated groups. Can you think of reasons (besides batch effects) why these points might not be grouped together? Note that the two colon cancer cell lines (SW620 and SW480) *are* grouped together.

Sometimes we can pick out patterns and information more easily by encoding multiple layers of data in our plot. For example, we can simultaneously look at cell line (color) and diagnosis (label) to more easily see how points relate to each other according to these variables.

```{r}
colors <- color_by_type(unique(cell_speeds_df$cellLine))
cell <- unique(cell_speeds_df$cellLine)
plot(pson_pca$x[, c(1, 2)],
     main = "Cell Line", pch = 19, cex = 0
)
text(pson_pca$x[, c(1, 2)], 
     labels = gsub(" Cancer", "", cell_speeds_df$diagnosis),
     col = scales::alpha(color_by_type(cell_speeds_df$cellLine), 0.8),
     pos = 3, offset = 0, cex = 0.8)

legend("bottomright", fill = colors, legend = cell, cex = 0.8)
```

Again, we can see that the two breast cancer cell lines do NOT cluster together. In fact, the MDA-MB-231 cell line appears more closely related to several other cancer types than to the T-47D cell line — why might this be?

---

# PCA with patient data from TCGA

## Load the data

We'll shortcut things here again, and let you run the chunk below to get the data in the right format for the next steps.

```{r}
load(file.path(data_dir, "tcga_brca_expr_gene_info.RData"))
load(file.path(data_dir, "tcga_brca_expr_norm_df.RData"))
load(file.path(data_dir, "tcga_brca_clinical_df.RData"))


x1 <- names(brca_expr_norm_df)[-1]
x2 <- brca_clinical_df[["bcr_patient_barcode"]]
sample.names <- intersect(x1, x2)
sample.names <- sort(sample.names)

I <- match(sample.names, names(brca_expr_norm_df))
I <- c(1, I)
brca_expr_norm_df <- brca_expr_norm_df[, I]

I <- match(sample.names, brca_clinical_df[["bcr_patient_barcode"]])
brca_clinical_df <- brca_clinical_df[I, ]

brca_expr_norm_df2 <- merge(x = gene_df, y = brca_expr_norm_df,
                            by.x = "gene_id",  by.y = "gene_id")

brca_expr_mat <- as.matrix(brca_expr_norm_df2[, -c(1:5)])
rownames(brca_expr_mat) <- brca_expr_norm_df2$symbol

brca_log_mat <- log2(1 + brca_expr_mat)
```

Take a look at `brca_log_mat`: 

```{r}
brca_log_mat[1:5, 1:5]
```

What step are we missing before calculating PCA (**note:** we're still interested in looking at the relationships between *samples*)?

```{r}
# Fill in the command here to "rotate" the matrix to get our data ready for PCA
brca_pca_mat <- ____(brca_log_mat) 
```


## Calculating PCA

Implementing the PCA algorithm in R is now straightforward. 

> This will take a bit longer than the PS-ON example above — take a moment to ask questions or look ahead to the next steps!

```{r}
brca_pca <- prcomp(brca_pca_mat, scale = FALSE)

dim(brca_pca$x)
brca_pca$x[1:5,1:5]
```

Let's look at the percent variance explained (or accounted for) by the principal components.

```{r}
pve <- 100 * brca_pca$sdev^2 / sum(brca_pca$sdev^2)
plot(pve, type = "o", cex = 0.8,
     ylab = "PVE", xlab = "Principal Component", col = "blue")
```

Because we have so many PCs (the number of samples - 1 = 1082), we'll consider only the first 100.

```{r}
par(mfrow = c(1, 2))

plot(pve, type = "o", cex = 0.8, 
     ylab = "PVE", xlab = "Principal Component", 
     xlim = c(1, 100), col = "blue")

plot(cumsum(pve), type = "o", cex = 0.8,
     ylab = "Cumulative PVE", xlab = "Principal Component", 
     xlim = c(1, 100), col = "brown3")
```

Still a bit hard to eyeball from the plots... We can also print out the percentages to get a better sense for how they're increasing:

```{r}
data.frame(PC = 1:10,
           PVE = round(pve[1:10], 0),
           Cumulative.PVE = round(cumsum(pve)[1:10], 0))
```

In contrast to the PS-ON PCA where the first two PCs account for 40% of the variance, the first two PCs for the TCGA BRCA analysis account for only 20% of the variance. However, for large gene expression data sets, this is considered pretty good, and often PC1 and PC2 carry a fair amount of useful information.

```{r}
plot(brca_pca$x[, c(1, 2)], cex = 0.8)
```

We see roughly two clusters. Do these have biologial meaning?  What clinical information do we have on the TCGA BRCA samples?

```{r}
names(brca_clinical_df)
```

Ok, so we have quite a lot of information. Let's consider estrogen receptor (ER) status. We considered this in the lecture for a different brca set. I am including lines to pull the progeterone receptor (PR) and the receptor tyrosine kinase erbB-2 (HER2) for your examination later.

```{r}
ER_status <- brca_clinical_df$breast_carcinoma_estrogen_receptor_status
```

What values are encompassed by estrogen receptor status in the brca clinical features data frame? How many samples are labeled with each value?

```{r}
# one way to check this is to loop through each unique value and then count
# the occurrences of each
unique_status <- unique(ER_status)
for (i in 1:length(unique_status)) {
  print(c(unique_status[i], length(ER_status[ER_status == unique_status[i]])))
}
```

```{r}
# pro tip: another way to do this is with the `table()` function!
table(ER_status)
```


## Visualizing patterns and groups with PCA

We'll color the samples in the PC1/PC2 projection plot according to ER status.

```{r}
colors <- color_by_type(unique(ER_status))
status <- unique(ER_status)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(ER_status), 0.7),
     main = "ER status", pch = 19, cex = 0.8
)

legend("bottomleft", fill = colors, legend = status, cex = 0.8)
```

---

# mini-DREAM Challenge  

**QUESTION:** Does triple-negative status correspond to major variation among TCGA breast cancer samples? If so, do we expect this variation to relate positively or negatively to metastatic outcomes? What genes individual genes might be driving this variation?

Triple-negative breast cancers are generally diagnosed based on the lack of three receptors known to fuel most breast cancers: estrogen receptors (ER), progesterone receptors (PR) and human epidermal growth factor receptor 2 (HER2). The most successful treatments for breast cancer target these receptors.

To investigate these questions, follow the instructions and provide your answers in the 3 "parts" below for these variables:

+ `my_pc1_receptor`
+ `my_tripleneg_met_association`
+ `my_hpa_gene`
+ `my_hpa_is_enchanced`
+ `my_hpa_is_prognostic`


## PART 1: Inspecting the other 2/3 of "triple"

Take a look at how the expression of PR and HER2 relate to the first principal component (PC1) in the data.

```{r}
# Replace the blanks with the variable name for PR status (remember that PR
# is short for "progesterone receptor")
PR_status <- brca_clinical_df$____

colors <- color_by_type(unique(PR_status))
status <- unique(PR_status)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(PR_status), 0.7),
     main = "PR status", pch = 19, cex = 0.8
)

legend("bottomleft", fill = colors, legend = status, cex = 0.8)
```

```{r}
# Replace the blanks with the variable name for HER2 status. Note: we're
# specifically looking for HER2 status determined from immunohistochemistry 
# lab processing
HER2_status <- brca_clinical_df$____

colors <- scales::alpha(color_by_type(unique(HER2_status)))
status <- unique(HER2_status)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(HER2_status), 0.7),
     main = "HER2 status", pch = 19, cex = 0.8
)

legend("bottomleft", fill = colors, legend = status, cex = 0.8)
```

Which of the two PC2 vs. PC1 plots is most informative? One of the two receptors should show better separation of "Positive" and "Negative" status along the PC1 axis.

```{r}
# fill in "PR" or "HER2"
my_pc1_receptor <- ""
```


## PART 2: Triple-negative status and outcome

First, check your answer! If the expression status of one of these two genes is less informative, then a "double negative" status based on only ER and the more informative gene should show greater separation than the when all 3 genes are included.


```{r, warning=FALSE}
less_informative <- ____ # fill in HER2_status or PR_status - no quotes!
more_informative <- ____ # fill in HER2_status or PR_status - no quotes!

triple_negative <- (
  ER_status == "Negative" &  
  more_informative == "Negative" &
  less_informative == "Negative"
)

double_negative <- (
  ER_status == "Negative" &  
  more_informative == "Negative"
)

par(mfrow = c(1, 2))

colors_trip <- scales::alpha(color_by_type(unique(triple_negative)))
status_trip <- unique(triple_negative)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(triple_negative), 0.5),
     main = "Triple-negative status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors_trip, legend = status_trip, cex = 0.8)

colors_doub <- scales::alpha(color_by_type(unique(double_negative)))
status_doub <- unique(double_negative)
plot(brca_pca$x[, c(1, 2)],
     col = scales::alpha(color_by_type(double_negative), 0.5),
     main = "Double-negative status", pch = 19, cex = 0.8
)
legend("bottomleft", fill = colors_doub, legend = status_doub, cex = 0.8)
```

Assume that PC1 is *positively correlated* with metastasis (we don't know for sure yet whether it is or not). If that were the case, would "triple negative" be more or less likely to be associated with metastatic samples?

```{r}
# fill in "more" or "less"
my_tripleneg_met_association <- "" 
```

## PART 3: Connecting PC1 to individual genes

PCA calculates loading values that rank, in a relative sense, how important genes are to a particular principal component. Look again at the information PCA provides:

```{r}
names(brca_pca)
```

The `rotation` element contains loading values for each principal component (column) on each gene (rows). Let's create a new variable to store these values.

```{r}
# Create a matrix of loading values
loading_values_mat <- brca_pca$rotation
```

For example, the first column of the loading value matrix contains the loading values for PC1 on each gene.  Sort the loading values for PC1 in increasing order; these have the most negative values for PC1.

```{r}
# Sort and inspect PC1 loading values
pc1_loadings <- loading_values_mat[, 1]
pc1_loadings_sorted <- sort(pc1_loadings, decreasing = FALSE)


# Get the first 6 values from `pc1_loadings_sorted`
head(pc1_loadings_sorted)
```


> You should see the following output.

```
       AGR3        TFF1        ESR1        TFF3     C1orf64        AGR2
-0.06289317 -0.05814906 -0.05189125 -0.05133283 -0.05036492 -0.04952857
```

Consider the first few genes. Search for one in the Human Protein Atlas (HPA), https://www.proteinatlas.org/.  Click on the "pathology" link to learn more about this gene's role in different cancers. Is the gene highly expressed in breast cancer? Does it appear to be prognostic of breast cancer?   

```{r}
# Enter the name of your gene here. It should match one of the 6 names above.
my_hpa_gene <- "" 

# Look at the section named "RNA EXPRESSION OVERVIEW". Is breast cancer one
# of the diagnoses listed as "tissue enhanced"? Fill in "yes" or "no".
my_hpa_is_enchanced <- "" 

# Look at the section named "PROGNOSTIC SUMMARY". Is the gene reported to be a
# prognostic marker for breast cancer? Fill in "yes" or "no".
my_hpa_is_prognostic <- "" 
```


## Submitting the prediction

You're now ready to submit the prediction. Just run the chunk below, a file with your prediction will be uploaded to Synapse and submitted to the challenge. You'll be able to see the results of your prediction on the mini-DREAM scoreboards, with the submission ID that gets printed below.

```{r}
library(synapser)

# If you didn't submit an answer for a previous module, remove the '#'s
# on the next 2 lines and enter your Synapse username and password
# before running this chunk.
# synLogin('my_synapse_username', 'my_synapse_password',
#          rememberMe = TRUE, silent = TRUE)

# If you submitted for Module 0, 1, or 2, this should work.
synLogin(silent = TRUE)
submission <- submit_module_answers(module = 3)
```

Congrats — you’ve reached the end of **Module 3**! You can now return to the **mini-DREAM Challenge** site on Synapse.

---

# Bonus: Image compression with PCA

Just for fun, we've included a little activity to demonstrate how PCA (and dimensionality reduction) can be used to compress an image file. While this isn't exactly how common compression formats work — e.g., JPEG doesn't use PCA — the idea of capturing and preserving useful information (and discarding less useful information) is the same. Feel free to check it out!

> WARNING: the code below might look strange compared to the R you've encountered in other modules. The style and some of the specific packages/functions are generally referred to as part of the "tidyverse" (which you can look up if you want to learn more). Don't worry too much about understanding the code here — you won't be tested on it!

```{r, warning=FALSE, message=FALSE}
library(imager)
library(broom)
library(tidyverse)
```

Get Synapse IDs for TIFF images of the **MDA-MB-231** breast cancer cell line from the motility study (from a single plate in the "Hyaluronic Acid Fibronectin" condition).

```{r}
img_file <- file.path(data_dir, "MDAMB231_plate_1_timelapse_61_001.tif")
cell_img <- load.image(img_file)
cell_img
```

## Viewing a cell image in R

Here's what the image looks like, as downloaded.

```{r}
plot(cell_img)
```

If we save the file as a JPEG (with maximum quality), we can get a starting size.

```{r}
save.image(cell_img, "img.jpeg", quality = 1)
fs::file_info("img.jpeg") %>% 
  pluck("size")
```

## Getting PCs for the image matrix

Convert the image to a matrix and compute principal components (PCs) with `prcomp`.

```{r}
cell_img_mat <- as.matrix(cell_img)
cell_img_pca <- prcomp(cell_img_mat)
```

The number of PCs (not surprisingly) is equal to the dimensions of the original image.

```{r}
cell_img_pca_tidy <- tidy(cell_img_pca, matrix = "pcs") 
nrow(cell_img_pca_tidy)
```

Let's check out the percent variance explained by the top 50 PCs.

```{r}
cell_img_pca_tidy %>% 
  slice(1:50) %>%
  ggplot(aes(x = PC, y = percent)) +
  geom_col(alpha = 0.7)
```

A fair amount of variance in the image data is explained by PC1. 

## Using PC projections to compress the image

I'll create a function to pick out the top **`num_pcs`** to create a compressed version of the image. Plotting the reconstructed image from PC1 only:

```{r}
compress_img <- function(img_pca, num_pcs) {
  img_pca$x[, 1:num_pcs] %*% t(img_pca$rotation[, 1:num_pcs]) %>% 
    t() %>% 
    as.cimg()
}
compress_img(cell_img_pca, num_pcs = 1) %>% 
  plot()
```

So, a certainly a hint of the original... (yes, the image has been rotated 90 degrees — let's not worry about that for now)

What about using the top 10 PCs?

```{r}
compress_img(cell_img_pca, num_pcs = 10) %>% 
  plot()
```

Approaching the compression a bit more systematically, I can figure out how many PCs would be needed to capture 75% of the variation (~information) in the image data.

```{r}
cell_img_pca_tidy %>% 
  ggplot(aes(x = PC, y = cumulative)) +
  geom_col(alpha = 0.7) +
  geom_vline(aes(xintercept = max(PC[cumulative <= 0.75])),
             linetype = 2, size = 1, colour = "blue3")
```

```{r}
pcs_75 <- cell_img_pca_tidy %>% 
    filter(cumulative <= .75) %>%
    top_n(1, PC) %>%  
    pluck("PC")
pcs_75
```

That's only **`r pcs_75`** PCs (out of `r nrow(cell_img_pca_tidy)`) to reconstruct a fairly accurate version of the image!

```{r}
compress_img(cell_img_pca, num_pcs = pcs_75) %>% 
  plot()
```

I can now save the compressed image as a JPEG to see the resulting file size.

```{r}
cell_img_compressed <- compress_img(cell_img_pca, num_pcs = pcs_75)

save.image(cell_img_compressed, "img_compressed.jpeg", quality = 1)
fs::file_info("img_compressed.jpeg") %>% 
  pluck("size")
```

That's about 60% of the original file size. Success!

---